Overview

Subquestions:

1. What is the most frequent word used in the Populor Songs?

4. Is there a pattern for those successful songs?

5. Is there a relationship between the mood in songs with the Twitter Users’ attitudes?

Data Collection and Data Cleaning

Music Data

This project collected the information for music and Twitter posts from different APIs.

The billboard.py, a Python API for accessing music charts from Billboard.com, is used to collect the titles and artists’ names. With the music information, the PyLyrics, a python module to get lyrics of songs from lyrics.wikia.com, helped to find those lyrics.

The biggest problem for the data cleaning in this part is those special signs in the titles or the singer lists:

Graph 1(Tableau)

Besides, there are also some languages other than English in the dataset:

Graph 2(matplotlib), Source Code
Graph 3(matplotlib), Source Code

We can see that there are very few songs using languages other than English. The songs with different languages are deleted so that the sentimental analysis will be more accurate.

The comparision between data before and after data cleaning

Graph 4(Tableau)

Twitter Data

Twitter data cleaning

When cleaning Twitter data, we first check whether same Twitter post link appear more than once in our dataset, if it does we remove all those duplicate records. Then we removed those non-alphabetic symbols and also hyperlink from text data. And the plot below provides statistics on those records we cleaned.

Graph 6(matplotlib), Source Code

Twitter sentiment analysis

To analyze the sentiment in tweet, we generate a new feature, sentiment score, in the dataset. We build our sentiment analysis tool using Word2Vec and Deep Learning tools. We first obtain our training dataset which has 1.4 million Twitter data with labeled sentiment score from Sentiment140. Then we tokenize both training data and our own Twitter data using NLTK tokenizer. Then vectorize each token using Gensim Word2Vec model with vector size 512 and window size 10. Then construct our 7 layers Convolutional Neural Network model and trained it using our training data.

With 10 fold cross-validation, we gained 75% accuracy and we apply our model to predict sentiment score on collected Twitter data. Which yield frequency distribution as below:

Graph 7(matplotlib), Source Code

1. What are the most frequent words used in Populor Songs?

Let’s start to find our answer by showing some EDA plots. In the first sub-question, this project aims to simply study what is the word used most frequently in the lyrics. We first use the word cloud plot for all the songs in Billboard Top 100 to study it.

Graph 8(matplotlib, Word Cloud), Source Code

5. Is there a relationship between the mood in songs with the Twitter Users’ attitudes?

The fifth sub-question is whether there is a relationship between the sentiment analysis results from tweets and lyrics. In other words, the question becomes “Do the lyrics’ attitudes influence people’s comments about them on tweets?” and “whether the result of two attitudes’ connection influence the popularity of the songs?”

In order to answer the questions above, we have to take a look of the results of sentiment analysis from tweets and lyrics.

Graph 28(Seaborn), Source Code

The graph above shows the two-dimensional plot of sentiment analysis attitude scores. As we can see, it seems like there is no linear relationship between the tweets’ attitude and lyrics’ attitude. However, we can not be sure yet until we run some tests. Still, the graph gives us one interesting information. Most points are located on the top and bottom of the graph, which means most of the songs are either too negative or too positive. And those songs have more attention to tweets than the normal songs.

Next, let us take a look at each individual analysis result.

Graph 29(Seaborn), Source Code

As the boxplot shows above, we can see that lyrics have much higher attitude points than tweets. Most of the tweets’ attitude points are below normal, which means they are more negative while most lyrics of the songs tend to be positive. However, we still cannot be 100% sure that they are significant differences, so we will run one-way ANOVA test to testify our thoughts.

Hypothesis test of One-Way ANOVA test.

Hypothesis1, Source Code

As the p-value show above, since the p-value is below 0.05, which rejects the null hypothesis. Then it means there is a significant difference between lyrics’ attitude points and tweets’ attitude points. Since we know they are different, we are moving to the next part, to test whether there is a relationship between the lyrics’ attitude and tweets’ attitude.

Hypothesis test of Linear regression test.

Hypothesis2, Source Code

As the results show above, we used the number of weeks that songs stayed on board as addition term that may relate to the attitude points. Then we found out that all p-values are much greater than 0.05, which means it failed to reject the null hypothesis, and it means there is no significant linear relationship between Tweets’ attitude points and neither lyrics’ attitude points nor the number of weeks that songs stayed on board( which may also be thought as popularity).

Conclusion and Future Study